Setup
What is the tidyverse?
The tidyverse consists of a few key packages for data import, manipulation, visualization and more.
Data Structures
Vectors form the basis of R data structures. Two main types are atomic and lists.
Data frames
Data frames are a special kind of list, and probably the most commonly used for data science purposes.
Importing Data
Importing data is usually the first step.
Working with Databases
Databases must be connected to, but otherwise are used just like data frames.
Selecting Columns
A common step is to subset the data by column.
Filtering Rows
To filtering data, think of a logical statement, something that can be TRUE or FALSE.
Generating new data
Another very common data processing task is to generate new variables.
Merging
Merging data can take on a variety of forms, and depending on the data, can be be quite complicated.
Exercises
Selecting and filtering
Use the : operator to select successive columns.
Filter the data to award amounts less than 500000.
Generating new data
Generate a new award amount variable that is the log of the original. Give the new variable a useful name.
Python examples
Using Python for data science is not far removed from R
Python’s main data processing module is pandas
LS0tCnRpdGxlOiAiTW9kdWxlIDE6IERlYWxpbmcgd2l0aCBEYXRhIgpvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6IAogICAgaGlnaGxpZ2h0OiBweWdtZW50cwogICAgdGhlbWU6IHNhbmRzdG9uZQogICAgY3NzOiBvdGhlci5jc3MKZWRpdG9yX29wdGlvbnM6IAogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUKLS0tCgpgYGB7ciBpbml0LCBlY2hvPUZBTFNFfQojIHRoZXNlIG9wdGlvbnMgYXJlIHByaW1hcnkgdXNlZnVsIHRvIHRoZSBjcmVhdGlvbiBvZiB0aGUgaHRtbCBkb2N1bWVudAprbml0cjo6b3B0c19jaHVuayRzZXQoCiAgZWNobz1ULCAKICBldmFsID0gRiwKICBtZXNzYWdlID0gRiwgCiAgd2FybmluZyA9IEYsIAogIGNvbW1lbnQgPSBOQSwKICBSLm9wdGlvbnM9bGlzdCh3aWR0aD0xMjApLCAKICBjYWNoZS5yZWJ1aWxkPUYsIAogIGNhY2hlPVQsCiAgZmlnLmFsaWduPSdjZW50ZXInLCAKICBmaWcuYXNwID0gLjcsCiAgZGV2ID0gJ3N2ZycsIAogIGRldi5hcmdzPWxpc3QoYmcgPSAndHJhbnNwYXJlbnQnKQopCgpgYGAKCiMjIFNldHVwCgpXaGF0IGlzIHRoZSBgdGlkeXZlcnNlYD8KClRoZSBgdGlkeXZlcnNlYCBjb25zaXN0cyBvZiBhIGZldyBrZXkgcGFja2FnZXMgZm9yIGRhdGEgaW1wb3J0LCBtYW5pcHVsYXRpb24sIHZpc3VhbGl6YXRpb24gYW5kIG1vcmUuCgpgYGB7ciBzZXR1cH0KbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKCiMjIE9iamVjdHMgYW5kIENsYXNzZXMKCmBgYHtyIG9iamVjdHMsIGVjaG89RkFMU0V9CnggPSAxOjMKeSA9ICdhJwp6ID0gbGlzdChvbmUgPSB4LCB0d28gPSB5KQoKeAp5CnoKYGBgCgpgYGB7ciBpbnNwZWN0fQpzdHIoeikKY2xhc3MoeSkKYGBgCgojIyBEYXRhIFN0cnVjdHVyZXMKClZlY3RvcnMgZm9ybSB0aGUgYmFzaXMgb2YgUiBkYXRhIHN0cnVjdHVyZXMuIFR3byBtYWluIHR5cGVzIGFyZSBhdG9taWMgYW5kIGxpc3RzLgoKYGBge3IgdmVjdG9yfQpteV92ZWN0b3IgPC0gYygxLCAyLCAzKSAgICMgc3RhbmRhcmQgdmVjdG9yCmBgYAoKYGBge3IgbGlzdH0KbXlfbGlzdCA8LSBsaXN0KGEgPSAxLCBiID0gMikgICAjIGEgbmFtZWQgbGlzdApteV9saXN0CmBgYAoKIyMgRGF0YSBmcmFtZXMKCkRhdGEgZnJhbWVzIGFyZSBhIHNwZWNpYWwga2luZCBvZiBsaXN0LCBhbmQgcHJvYmFibHkgdGhlIG1vc3QgY29tbW9ubHkgdXNlZCBmb3IgZGF0YSBzY2llbmNlIHB1cnBvc2VzLgoKYGBge3IgZGF0YV9mcmFtZSwgZXZhbD1UUlVFfQpteV9kYXRhID0gZGF0YS5mcmFtZSgKICBpZCA9IDE6MywKICBuYW1lID0gYygnVmVybm9uJywgJ0FjZScsICdDb3JhJykKKQoKbXlfZGF0YQpjbGFzcyhteV9kYXRhKQpgYGAKCiMjIEltcG9ydGluZyBEYXRhCgpJbXBvcnRpbmcgZGF0YSBpcyB1c3VhbGx5IHRoZSBmaXJzdCBzdGVwLgoKYGBge3IgaW1wb3J0LCBldmFsPVRSVUV9CmRlbW9ncmFwaGljcyA9IHJlYWQuY3N2KCdkYXRhL2RlbW9zX2Fub255bWl6ZWQuY3N2JykKaWRzID0gcmVhZC5jc3YoJ2RhdGEvaWRzX2Fub255bWl6ZWQuY3N2JykKYGBgCgojIyBXb3JraW5nIHdpdGggRGF0YWJhc2VzCgpEYXRhYmFzZXMgbXVzdCBiZSBjb25uZWN0ZWQgdG8sIGJ1dCBvdGhlcndpc2UgYXJlIHVzZWQganVzdCBsaWtlIGRhdGEgZnJhbWVzLgoKYGBge3IgZGF0YWJhc2VzLCBldmFsPUZBTFNFfQojIHJlcXVpcmVzIERCSSBhbmQgUlNRTGl0ZSBwYWNrYWdlcwpsaWJyYXJ5KERCSSkKY29uIDwtIGRiQ29ubmVjdChSU1FMaXRlOjpTUUxpdGUoKSwgIjptZW1vcnk6IikKIyBjb24KCmNvcHlfdG8oY29uLCBkZW1vZ3JhcGhpY3MsICdkZW1vcycpCmBgYAoKCgojIyBTZWxlY3RpbmcgQ29sdW1ucwoKQSBjb21tb24gc3RlcCBpcyB0byBzdWJzZXQgdGhlIGRhdGEgYnkgY29sdW1uLgoKYGBge3Igc2VsZWN0MX0KZGVtb2dyYXBoaWNzICU+JSAKICBzZWxlY3QoZ2VuZGVyLCBhZ2UsIGxpYnVzZXIpCmBgYAoKCmBgYHtyIHNlbGVjdDJ9CmRlbW9ncmFwaGljcyAlPiUgCiAgc2VsZWN0KC1saWJ1c2VyKQpgYGAKCmBgYHtyIHNlbGVjdDN9CmRlbW9ncmFwaGljcyAlPiUgCiAgc2VsZWN0KHN0YXJ0c193aXRoKCdhd2FyZCcpKQpgYGAKCgojIyBGaWx0ZXJpbmcgUm93cwoKVG8gZmlsdGVyaW5nIGRhdGEsIHRoaW5rIG9mIGEgbG9naWNhbCBzdGF0ZW1lbnQsIHNvbWV0aGluZyB0aGF0IGNhbiBiZSBgVFJVRWAgb3IgYEZBTFNFYC4KCmBgYHtyIGZpbHRlcn0KbXlfZmlsdGVyZWRfZGF0YSA9IGRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGFnZSA8IDQwKQoKbXlfZmlsdGVyZWRfZGF0YSA9IGRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGxpYnVzZXIgPT0gMSkKYGBgCgoKIyMgR2VuZXJhdGluZyBuZXcgZGF0YQoKQW5vdGhlciB2ZXJ5IGNvbW1vbiBkYXRhIHByb2Nlc3NpbmcgdGFzayBpcyB0byBnZW5lcmF0ZSBuZXcgdmFyaWFibGVzLgoKYGBge3IgbXV0YXRlfQpkZW1vZ3JhcGhpY3MgPSBkZW1vZ3JhcGhpY3MgJT4lIAogIG11dGF0ZShuZXdfYWdlID0gKGFnZSAtIG1lYW4oYWdlLCBuYS5ybSA9IFQpKS9zZChhZ2UsIG5hLnJtID0gVCkpICAgCmBgYAoKIyMgUmVuYW1pbmcgY29sdW1ucwoKCgpgYGB7ciByZW5hbWUxfQpkZW1vZ3JhcGhpY3MgPSBkZW1vZ3JhcGhpY3MgJT4lIAogIHJlbmFtZShhZ2Vfc3RkID0gbmV3X2FnZSkKYGBgCgpgYGB7ciByZW5hbWUyfQpkZW1vZ3JhcGhpY3MgJT4lIAogIHJlbmFtZV9hbGwodG91cHBlcikgJT4lIAogIGNvbG5hbWVzKCkKYGBgCgojIyBNZXJnaW5nCgpNZXJnaW5nIGRhdGEgY2FuIHRha2Ugb24gYSB2YXJpZXR5IG9mIGZvcm1zLCBhbmQgZGVwZW5kaW5nIG9uIHRoZSBkYXRhLCBjYW4gYmUgYmUgcXVpdGUgY29tcGxpY2F0ZWQuCgpgYGB7ciBleGFtcGxlX2pvaW5zfQojIHNhbWUgTiByb3dzIGFzIGRlbW9zCmxlZnRfam9pbihkZW1vZ3JhcGhpY3MsIGlkcykKCiMgb25seSB+IDUwayByb3dzCmlubmVyX2pvaW4oZGVtb2dyYXBoaWNzLCBpZHMpIApgYGAKCiMjIEV4ZXJjaXNlcwoKIyMjIFNlbGVjdGluZyBhbmQgZmlsdGVyaW5nCgpVc2UgdGhlIGA6YCBvcGVyYXRvciB0byBzZWxlY3Qgc3VjY2Vzc2l2ZSBjb2x1bW5zLgoKYGBge3IgZXgxYSwgZXZhbD1GQUxTRX0KY29sbmFtZXMoZGVtb2dyYXBoaWNzKQoKZGVtb2dyYXBoaWNzICU+JSAKICBzZWxlY3QoPykKYGBgCgpGaWx0ZXIgdGhlIGRhdGEgdG8gYXdhcmQgYW1vdW50cyBsZXNzIHRoYW4gNTAwMDAwLgoKCmBgYHtyIGV4MWIsIGV2YWw9RkFMU0V9CmRlbW9ncmFwaGljcyAlPiUgCiAgZmlsdGVyKGF3YXJkX3RvdGFsX2Ftb3VudCA/KQpgYGAKCiMjIyBHZW5lcmF0aW5nIG5ldyBkYXRhCgpHZW5lcmF0ZSBhIG5ldyBhd2FyZCBhbW91bnQgdmFyaWFibGUgdGhhdCBpcyB0aGUgbG9nIG9mIHRoZSBvcmlnaW5hbC4gIEdpdmUgdGhlIG5ldyB2YXJpYWJsZSBhIHVzZWZ1bCBuYW1lLgoKYGBge3IgZXgyLCBldmFsPUZBTFNFfQpkZW1vZ3JhcGhpY3MgJT4lIAogIG11dGF0ZSg/ID0gbG9nKD8pKQpgYGAKCgojIyBQeXRob24gZXhhbXBsZXMKClVzaW5nIFB5dGhvbiBmb3IgZGF0YSBzY2llbmNlIGlzIG5vdCBmYXIgcmVtb3ZlZCBmcm9tIFIKClB5dGhvbidzIG1haW4gZGF0YSBwcm9jZXNzaW5nIG1vZHVsZSBpcyBgcGFuZGFzYAoKCiMjIyBJbXBvcnQKCmBgYHtweXRob24gcHlfaW1wb3J0LCBlbmdpbmUucGF0aD0gJy9Vc2Vycy9taWNsL2FuYWNvbmRhMy9iaW4vcHl0aG9uJ30KaW1wb3J0IHBhbmRhcyBhcyBwZAppbXBvcnQgbnVtcHkgIGFzIG5wCgpkZW1vZ3JhcGhpY3MgPSBwZC5yZWFkX2NzdignZGF0YS9kZW1vc19hbm9ueW1pemVkLmNzdicpCmlkcyA9IHBkLnJlYWRfY3N2KCdkYXRhL2lkc19hbm9ueW1pemVkLmNzdicpCgpkZW1vZ3JhcGhpY3MuaGVhZCgpICAjIHNob3cgYSBmZXcgbGluZXMKYGBgCgojIyMgU2VsZWN0aW5nIENvbHVtbnMKCmBgYHtweXRob24gcHlfc2VsZWN0fQojIHNlbGVjdCBieSBuYW1lCmRlbW9ncmFwaGljc1tbJ2FnZScsICdhd2FyZF90b3RhbF9hbW91bnQnXV0KYGBgCgoKYGBge3B5dGhvbiBweV9zZWxlY3QyfQojIHNlbGVjdCBzdWNjZXNzaXZlIGNvbHVtbnMKZGVtb2dyYXBoaWNzLmxvY1s6LCdsaWJ1c2VyJzonYWdlJ10KYGBgCgoKYGBge3B5dGhvbiBweV9zZWxlY3QzfQojIHNlbGVjdCBieSBwYXR0ZXJuCmRlbW9ncmFwaGljcy5maWx0ZXIocmVnZXg9J15hd2FyZCcpIApgYGAKCiMjIyBGaWx0ZXJpbmcgUm93cwoKYGBge3B5dGhvbiBweV9maWx0ZXJ9Cm15X2ZpbHRlcmVkX2RhdGEgPSBkZW1vZ3JhcGhpY3NbZGVtb2dyYXBoaWNzLmxpYnVzZXIgPT0gMV0KbXlfZmlsdGVyZWRfZGF0YS5saWJ1c2VyLm51bmlxdWUoKQpgYGAKCiMjIyBHZW5lcmF0aW5nIG5ldyBkYXRhCgpgYGB7cHl0aG9uIHB5X211dGF0ZX0KZGVtb2dyYXBoaWNzW1snbmV3X2FnZSddXSA9IChkZW1vZ3JhcGhpY3NbWydhZ2UnXV0gLSBucC5tZWFuKGRlbW9ncmFwaGljc1tbJ2FnZSddXSkpIC8gbnAuc3RkKGRlbW9ncmFwaGljc1tbJ2FnZSddXSkKCmRlbW9ncmFwaGljcy5uZXdfYWdlLmRlc2NyaWJlKCkgICMgbWVhbiA9IDAgIHNkID0gMQpgYGAKCiMjIyBSZW5hbWluZyBjb2x1bW5zCgpgYGB7cHl0aG9uIHB5X3JlbmFtZX0KZGVtb2dyYXBoaWNzID0gZGVtb2dyYXBoaWNzICU+JSAKICByZW5hbWUoYWdlX3N0ZCA9IG5ld19hZ2UpCmBgYAoKCiMjIyBKb2lucwoKYGBge3B5dGhvbiBweV9sZWZ0X2pvaW59CmRlbW9zX2pvaW5lZCA9IHBkLm1lcmdlKGRlbW9ncmFwaGljcywgaWRzLCBob3c9J2xlZnQnLCBvbj0nRU1QTElEJykKZGVtb3Nfam9pbmVkCmBgYAoKYGBge3B5dGhvbiBweV9sZWZ0X2pvaW4yfQpkZW1vc19qb2luZWQgPSBkZW1vZ3JhcGhpY3Muam9pbihpZHMsIGhvdz0nbGVmdCcsIGxzdWZmaXg9J0VNUExJRCcpCmRlbW9zX2pvaW5lZC5zaGFwZQpkZW1vc19qb2luZWQuY29sdW1ucwpgYGAKCgpgYGB7cHl0aG9uIHB5X2lubmVyX2pvaW59CmRlbW9zX2pvaW5lZCA9IGRlbW9ncmFwaGljcy5qb2luKGlkcywgaG93PSdpbm5lcicsIGxzdWZmaXg9J0VNUExJRCcpCgpkZW1vc19qb2luZWQuY29sdW1ucwpgYGAKCg==